Crucial to all good statistical analyses is a thorough exploratory analysis. In this document, we will introduce a few techniques for developing an understanding of our dataset, including an understanding of its limitations.
Let’s load all the required R packages. See our tutorial 0_Installing_packages.html for installation information.
library(rgeos)
library(rgdal)
library(sp)
library(maptools)
library(spatstat)
library(spdep)
library(INLA)
library(inlabru)
library(readxl)
library(lubridate)
library(ggmap)
library(raster)
If you have not done so already, you need to download the two set of files associated with the workshop.
All the workshop script and tutorial files can be found on Github at github.com/joenomiddlename/DFO_SDM_Workshop_2020. To download these files on your computer, press on Code and then Download ZIP. Once downloaded, unzip the DFO_SDM_Workshop_2020-main.zip file. To be able to access some of the precompiled data, we need to make the unzipped folder the working directory. To do so in R Studio, navigate to the correct folder using the bottom right panel (folder view, you can use … button) and open it. Then, click “Set as Working Directory” under the tab ‘More’.
We should be inside the folder titled: ‘DFO_SDM_Workshop_2020’, you can verify this by using getwd().
Now we can load in the precompiled data
list2env(readRDS('./Data/Compiled_Data_new.rds'), globalenv())
## <environment: R_GlobalEnv>
Throughout this workshop, we use data collected on fin whales in June, July, and August across the years 2007-2009 and for 2011. There are two main types of data:
sightings from systematic aerial surveys. These aerial surveys were conducted by both NOAA and DFO. Aerial survey sightings are accompanied with the aerial tracklines. Three surveys were combined: NOAA NARW Surveys (Cole, NOAA), 2007 T-NASS Aerial Survey (Lawson et al. 2009), and NOAA Cetacean Surveys (Cole et al., NOAA).
opportunistic sightings from two whale-watching operators. These sightings are maintained in the DFO Maritimes Region Whale Sightings Database (MacDonald et. al. 2017).
These data are spatially-referenced data. The first thing we must always do is check for consistency between the coordinate reference systems (CRS) of each spatial object in use! To print the CRS of an sp object, simply add @proj4string to the name of the spatial object and run. For example, to check the CRS of the WW sightings we run:
Sightings_DRWW_sp@proj4string
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
We check the CRS of the other objects, but omit the code for brevity.
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
All of the spatial objects are in lat/lon - good! For future analysis we will be projecting the data into a different coordinate reference system to better preserve euclidean distance.
Finally, let’s turn off all warnings associated with coordinate reference systems. The latest PROJ6+/GDAL3+ updates have caused many warning messages to be printed.
rgdal::set_rgdal_show_exportToProj4_warnings(FALSE)
rgdal::set_thin_PROJ6_warnings(TRUE)
options("rgdal_show_exportToProj4_warnings"="none")
Our first goal is generally to plot the data on a map. The gg() and gmap() functions from the inlabru package will prove extremely useful at plotting spatial data! The class of spatial objects we will use are from the sp package. The classes of these objects begin with ‘Spatial’. For example SpatialPointsDataFrame.
We have written a bespoke function gg.spatiallines_mod() to easily add SpatialLinesDatFrame objects to the plots too. This will prove useful for plotting transect lines. We load the bespoke functions to the working environment now.
source('utility_functions.R')
Let’s plot our data!
If the data are in lat/lon format then the gmap() function will automatically add a terrain layer from Stamen Maps to the plots. Let’s plot the survey sightings in blue, the survey tracklines in black, the whale watch sightings in purple, the whale watch ports in red.
gmap(Sightings_survey) +
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
By default gmap uses Google maps (although this often changes (see ?get_map)), see Additional tips for ways to use open-source maps for publication.
To remove the map layer, simply replace the gmap(Sightings_survey) with ggplot().
ggplot() + # Notice the empty ggplot() call
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
This plot hides some crucial information regarding the data collection. For example, the survey sightings and tracklines do not come from a single survey, or even a single organisation! Let’s plot this! The easiest way to do this is to subset the data accordingly!
table(Effort_survey$DATASET)
##
## DFO NOAA_1 NOAA_2
## 49 54 70
# there are 3 surveys
ggplot() +
gg(Domain) +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='DFO',], colour='purple') +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_1',], colour='red') +
gg.spatiallines_mod(Effort_survey[Effort_survey$DATASET=='NOAA_2',], colour='yellow')
This is problematic! The DFO tracklines (in purple) do not overlap with the two NOAA surveys! Thus, any future model will be unable to identify any differences in protocol efficiency. This is because any model intercepts will be confounded with the latent spatial field. More on this later!
In addition, the surveys were conducted across 4 separate years. Let’s plot the survey tracklines by year. Do you see any cause for concern? Note that the names of the variables in the Effort_Survey are: DATASET and YEAR. Try this on your own! If you get stuck, click ‘Show Code’. Hint: The YEAR variable is of type character and contains 4 unique values (see below).
table(Effort_survey$YEAR)
##
## 2007 2008 2009 2011
## 101 20 19 33
class(Effort_survey$YEAR)
## [1] "character"
ggplot() +
gg(Domain) +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2007',], colour='purple') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2008',], colour='red') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2009',], colour='blue') +
gg.spatiallines_mod(Effort_survey[Effort_survey$YEAR=='2011',], colour='yellow')
## Regions defined for each Polygons
Note that the surveys from years 2007, 2008 and 2009 covered largely different regions! Again, this is problematic if we want to model any changes in the whale distribution over time! The effect of year will be confounded by the spatial field. That being said, the data from 2011 appear to be a good candidate for model comparison as the spatial range overlaps with the other 3 years’ effort. We will holdout this data as our test data and use 2007, 2008, and 2009 as our training data.
xtabs(~ YEAR + DATASET, data=Effort_survey@data)
## DATASET
## YEAR DFO NOAA_1 NOAA_2
## 2007 49 3 49
## 2008 0 20 0
## 2009 0 19 0
## 2011 0 12 21
2011’s data comes exclusively from NOAA.
For modelling, we will transform the data from lat/lon into a new “Canadian” CRS: NAD83 datum with UTM Zone 20N projection. This projection will help to preserve euclidean distance between points. The EPSG code is 2961.
To do the transformation, we will use the spTransform() function. For example, to transform the Whale Watch sightings spatial object Sightings_DRWW_sp, we first define the CRS object as follows:
Can_proj <- CRS("+init=EPSG:2961")
Can_proj <- fm_crs_set_lengthunit(Can_proj, unit='km')
The second line of code specifies that we want to work in units of km instead of the default meters. This can prove vital in applications to avoid numerical overflow.
Next, we transform Sightings_DRWW_sp:
Sightings_DRWW_sp <- spTransform(Sightings_DRWW_sp, Can_proj)
Sightings_DRWW_sp@proj4string
## CRS arguments:
## +proj=tmerc +lat_0=0 +lon_0=-63 +k=0.9996 +x_0=500000 +y_0=0
## +ellps=GRS80 +units=km +no_defs
Notice the changed output from calling @proj4string. We repeat this for all the spatial objects that are points or lines.
Sightings_survey <- spTransform(Sightings_survey, Can_proj)
Effort_survey <- spTransform(Effort_survey, Can_proj)
WW_ports <- spTransform(WW_ports, Can_proj)
Transforming the ‘raster’-like SpatialPixelsDataFrame objects (Bathym, Dist_Brier, and Dist_Quoddy) using spTransform would be inappropriate here. The projection leads to a curvature of the pixels. A more appropriate approach here is to use bilinear interpolation. The projectRaster() function from the raster package works great for this. This requires converting the SpatialPixelsDataFrame object into an object of type raster. This is made easy with the function raster(). Finally, to convert the raster object back into a SpatialPixelsDataFrame, we can use the as() function from the maptools package. This function is extremely useful for converting spatial objects between the popular packages: sp, spatstat, and sf. We use this function substantially throughout the workshop.
Bathym <- as(projectRaster(raster(Bathym), crs=Can_proj), 'SpatialPixelsDataFrame')
Domain <- spTransform(Domain, Can_proj)
Dist_Brier <- as(projectRaster(raster(Dist_Brier), crs=Can_proj), 'SpatialPixelsDataFrame')
Dist_Quoddy <- as(projectRaster(raster(Dist_Quoddy), crs=Can_proj), 'SpatialPixelsDataFrame')
Plot the (transformed) Bathymetry and Distance from Port spatial objects. We are going to combine these into a single plot using the multiplot() function from the inlabru package. This function takes as input ggplot objects and an argument layout, specifying how the plots should be arranged (see Additional tips for ways to change the layout).
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)')+
coord_fixed(ratio = 1),
layout=matrix(1:4, nrow=2, ncol=2, byrow = TRUE))
Don’t like the colour scheme? See Additional tips to learn how to define your own manually.
In it’s current form, the domain is too large for the application. The domain extends far beyond the survey tracklines and the whale watch ports. To combat this, we will create a new domain that is restricted to lie within 30km of the nearest trackline.
# Create a set of SpatialPixels across the domain
pixels <- SpatialPoints(makegrid(Domain, n=5000),proj4string = Domain@proj4string)
# Convert the SpatialPixels into a SpatialPoints object before subsetting
#pred_int_points <- as(pred_int,'SpatialPoints')
# Subset to 30 km from Effort lines - can be slow
pixels_restricted <-
pixels[which(apply(gWithinDistance(pixels, Effort_survey,dist = 30,byid = T),2,sum)>0),]
# Select only the points that lie within the Domain
pixels_restricted <- pixels_restricted[Domain,]
# Convert to SpatialPolygons via SpatialPixels
pixels_restricted <- as(as(pixels_restricted,'SpatialPixels'),'SpatialPolygons')
# Merge the SpatialPolygons into a single Domain polygon using gUnionCascaded
pixels_restricted <- gUnionCascaded(pixels_restricted)
# Plot to check the new domain
ggplot() +
gg(Bathym) +
gg(pixels_restricted) +
xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
coord_fixed(ratio = 1)
# Smooth the Domain using gSimplify
Domain_restricted <- gSimplify(pixels_restricted,tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain, color='red') +
gg.spatiallines_mod(Effort_survey, colour='blue')
Whilst the new SpatialPolygons object
Domain_restricted no longer extrapolates far beyond the survey tracklines, the smoothing achieved by the function gSimplify() has caused the new Domain to no longer contain all the tracklines. To counter this, we will buffer the study region prior to simplifying using gBuffer.
Domain_restricted <- gSimplify(gBuffer(pixels_restricted,width=15),tol=20)
# Plot
ggplot() +
gg(Domain_restricted) +
gg(Domain, color='red') +
gg.spatiallines_mod(Effort_survey, colour='blue')
## Regions defined for each Polygons
# Does the new domain contain all the survey tracklines?
gContains(Domain_restricted,Effort_survey)
## [1] TRUE
Great! We have defined a new buffered and smoothed domain that will no longer extrapolate far beyond our survey tracklines. To restore the original coastline definition we can use the function gIntersection(). We now update the SpatialPolygons object Domain using gIntersection():
Domain <- gIntersection(Domain_restricted,Domain)
ggplot() + gg(Domain) + gg.spatiallines_mod(Effort_survey)
Note that we will need to use Domain_restricted for defining our computational mesh later in the workshop! The smoothing of the coastline will help us to define a computational mesh that has fewer numerical issues. More on that later!
We will need to use Domain_restricted for defining our computational mesh later in the workshop! The package inlabru requires that our covariates are also defined at every point in the new domain. This will require extending our covariates to take ‘sensible’ values at points beyond our original domain (Domain). We will see that these imputed values will not affect the model estimates.
Our covariate depth is well-defined on the original (un-buffered) domain as defined by the SpatialPolygons Domain. At values outside this, we choose to ‘fill-in’ these points with nearest neighbour imputations. The code below does this for depth.
## Redefine log_Depth across our modified domain
# Create the log depth and log slope covariates
log_Depth <- Bathym
log_Depth$FIWH_MAR_Bathymetry[log_Depth$FIWH_MAR_Bathymetry >= 0] <- 0
log_Depth$FIWH_MAR_Bathymetry <- log(1-log_Depth$FIWH_MAR_Bathymetry)
names(log_Depth) <- 'log_Depth'
# 1) Define a set of pixels across our modified domain
pixels_Domain <- as(SpatialPoints(makegrid(Domain_restricted, n=100000),proj4string = Domain@proj4string),'SpatialPixels')
# 2) Extract values of the covariate at the new pixel locations
pixels_Domain$log_Depth <- over(pixels_Domain,log_Depth)$log_Depth
# 3) impute missing values with the nearest neighbour value
pixels_Domain$log_Depth[is.na(pixels_Domain$log_Depth)] <-
log_Depth$log_Depth[nncross(as(SpatialPoints(pixels_Domain@coords[which(is.na(pixels_Domain$log_Depth)),]),'ppp'),
as(SpatialPoints(log_Depth@coords),'ppp'),
what = 'which')]
log_Depth <- pixels_Domain
# 4) Create a squared log depth covariate (for later)
log_Depth_sq <- log_Depth
names(log_Depth_sq) <- 'log_Depth_sq'
log_Depth_sq$log_Depth_sq <- log_Depth_sq$log_Depth_sq^2
# Plot the covariates with Domain_restricted overlayed
ggplot() + gg(log_Depth) + gg(Domain_restricted)
In the first lecture, we defined the species’ intensity surface \(\lambda_{true}(s)\). We interpreted this as an ‘expected encounter rate per unit effort around s’. We then attempt to estimate/approximate an effort surface \(\lambda_{eff}(s)\). Both of these surfaces are linked to a set of covariates (and later we’ll even include random fields).
For well-designed surveys, e.g. distance-sampling surveys, we have additional data available on the perpendicular distance of the encountered animal to the observer. We saw that we can use this to jointly model a detection probability function \(p(d)\). inlabru, the package we will use in the later workshops, handles all of the neccessary integrations required to jointly model the encounter locations with the encounter distances.
Finally, we defined the observed intensity of the encounters \(\lambda_{obs}(s)\) as ‘the expected density of encounters per unit area around point s’. If our approximation of effort (\(\lambda_{eff}(s)\)), and the detection probability function \(p(d)\) is good, then we hope to see: \[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s)\]
or, for a single survey trackline: \[\lambda_{obs}(s) \approx \lambda_{true}(s) \lambda_{eff}(s) p(d(s))\]
Later, we will be fitting spatial point process models to the encounters. A requirement for using these models is that, conditional on \(\lambda_{eff}(s)\), each encounter with the animal is an independent snapshot from its true intensity \(\lambda_{true}(\textbf{s})\).
Planned surveys can be designed to (approximately) satisfy the independence assumption. The boat speed can be fixed at a high value relative to the individuals’ speeds (no double-counting) and a narrow perpendicular field-of-view can be enforced. Then, assuming that the individuals of the target species are in equilibrium with respect to their stationary distribution \(\lambda_{true}(\textbf{s})\) (i.e. the species’ density) and locally mixing faster than the time between revisits by observers, the independence assumption may reasonably hold.
Conversely, no such assumptions can be made regarding the whale-watch sightings. The unknown positions of the whale-watching boats, the 360 degree fields-of-view on-board, the variable boat speed, and the repeated sightings of the same individuals throughout each day can all invalidate the independence assumption required to fit the desired models.
In our workshop, we will see some potential solutions to address this issue of independence. In this session, we will perform some exploratory analyses to investigate the suitability of the independence assumption, and to investigate possible ways of estimating observer effort.
Let’s look at the whale-watching data. Specifically, let’s investigate the 2nd, 3rd and 4th sightings in the dataset. Notice that they were made on the same day and very close together in time. Often, little information is provided on the database management procedures. For example, are these multiple encounters with the same individual? Would the database discard or retain problematic repeated sightings?
Sightings_DRWW_sp@data[2:4,c('WS_TIME','WS_DATE','PLATFORM')]
## # A tibble: 3 x 3
## WS_TIME WS_DATE PLATFORM
## <Period> <dttm> <chr>
## 1 16H 5M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 2 16H 23M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
## 3 16H 32M 0S 2007-06-19 00:00:00 BRIER ISLAND WHALEWATCH
Without additional information available on the database management protocols, we can investigate whether these sightings could be of the same individual. To investigate this hypothesis, we can compute the distance between the sightings using the gDistance() function, the time between the sightings, and the required travel speed of the individual under the assumed hypothesis.
# How far away are these sightings in space (in kilometers)?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)
## 1 2 3
## 1 0.000000 1.316181 3.637082
## 2 1.316181 0.000000 2.386592
## 3 3.637082 2.386592 0.000000
# How far away are these sightings in time (in minutes)?
Sightings_DRWW_sp[3:4,]$WS_TIME-Sightings_DRWW_sp[2:3,]$WS_TIME
## [1] "18M 0S" "9M 0S"
# Could this be the same animal? How fast is this implied movement?
gDistance(Sightings_DRWW_sp[2:4,], byid = T)[cbind(c(1,2),c(2:3))] /
(c(18, 9)/60)
## [1] 4.38727 15.91061
This hypothesis would imply a traveling speed of between 4km/h and 16km/h. Is this plausible? Since fin whales are known to sustain much higher speeds (> 30km/h), we will discard all repeated sightings each day. We will keep only the first sighting made each day by each company.
Note that there is no overlap between the sighting locations from the two whale-watch companies, indicating that an assumption of independent encounters between the two companies could be reasonable.
Sightings_DRWW_sp$PLATFORM <- as.factor(Sightings_DRWW_sp$PLATFORM)
ggplot() + gg(Domain) + gg(Sightings_DRWW_sp, colour=Sightings_DRWW_sp$PLATFORM_CODE)
Thus we subset the data to keep only the initial sightings from each company:
Sightings_Brier_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='BRIER ISLAND WHALEWATCH',]
Sightings_Brier_nodup <- Sightings_Brier_nodup[!duplicated(Sightings_Brier_nodup$WS_DATE),]
Sightings_Quoddy_nodup <- Sightings_DRWW_sp[Sightings_DRWW_sp$PLATFORM=='QUODDY LINK',]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[!duplicated(Sightings_Quoddy_nodup$WS_DATE),]
dim(Sightings_DRWW_sp)[1]; dim(Sightings_Brier_nodup)[1]; dim(Sightings_Quoddy_nodup)[1];
## [1] 567
## [1] 85
## [1] 144
Notice that we have a total of 229 whale watch sightings after the subsetting procedure, down from an original count of 567 sightings.
Strictly speaking, for the desired point process models to be reasonable, it is required that the initial daily encounter locations from the whale watch vessels are independent realizations from \(\lambda_{true}\), conditioned on \(\lambda_{eff}\). We will assume that the typical journeys/routes of the whale watch vessels remains constant across months and years under study. Thus, we assume that \(\lambda_{eff}\) is constant through time.
As we discussed earlier, we are unable to model temporal changes in the species density due to the spatially non-overlapping survey efforts across the months and years. Thus any changes in the species intensity from June - August, or across the years 2007-2009 will be missed. Thus, we are forced to assume that the summer species density remains constant between 2007-2009. It is the summer species density that is our target for inference.
How reasonable is our assumption that the species density remains constant across the years and across the months? Again, we can visually assess the suitability of the assumption with a plot.
If each whale watch sighting is indeed an independent realisation from a temporally static point process, then there should be no association between the spatial distances between the whale watch sightings and how far apart in time they were made.
To plot distances against time intervals, we first compute the distances in time and the euclidean distances in space between each sighting.
difftime_Brier<- dist(as.numeric(ymd(Sightings_Brier_nodup$WS_DATE)))
diffspace_Brier <- dist(Sightings_Brier_nodup@coords)
difftime_Brier_fac <- as.factor(difftime_Brier)
difftime_Quoddy <- dist(as.numeric(ymd(Sightings_Quoddy_nodup$WS_DATE)))
diffspace_Quoddy <- dist(Sightings_Quoddy_nodup@coords)
difftime_Quoddy_fac <- as.factor(difftime_Quoddy)
We will create multiple plots with days on the x-axis and km on the y-axis. The first plot restricted to only consider temporal differences of less than 85 days. We are plotting a sequence of boxplots to help detect patterns.
ggplot(data=data.frame(difftime =
difftime_Brier_fac[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
The second plot is a different version, which uses a loess smoother curve to try to help detect the same patterns.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[as.numeric(difftime_Brier)<85],
diffspace = as.numeric(diffspace_Brier)[as.numeric(difftime_Brier)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess') +
xlab('difference in time (days)') +
ylab('difference in space (km)')
These two first plots indicate some moderate autocorrelation between the sightings, with encounters made ~1 month apart typically being around 50% further away than those made 24 hours apart. Notice that the trend plateaus after ~14 days. Perhaps this trend is due to the autocorrelation being caused exclusively by persistence in animal movement? Furthermore, the flat-line trend between day ~30 and day ~75 could be evidence in favour of an approximately constant density across the summer months. If the space use did indeed change dramatically across the months (e.g. due to migration), then we would expect to see a monotonically increasing trend over time.
The third plot investigates the space-time relationship across years.
ggplot(data=data.frame(difftime = as.numeric(difftime_Brier)[],
diffspace = as.numeric(diffspace_Brier)[]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')+
xlab('difference in time (days)') +
ylab('difference in space (km)')
This third plot shows no evidence that the average spatial separation between whale watch sightings differs year-to-year. This supports our assumption that the spatial density can be assumed to be constant across the years, albeit within the (very) small spatial region visited by the whale watch boats departing from Brier Island.
We now repeat the plots, but for the whale watch boats that depart from Quoddy.
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
ggplot() + gg(Domain) + gg(Sightings_DRWW_sp, colour=Sightings_DRWW_sp$PLATFORM_CODE)
ggplot(data=data.frame(difftime = as.numeric(difftime_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(x=difftime, y=diffspace)) +
geom_point() + geom_smooth(method='loess')
ggplot(data=data.frame(difftime =
difftime_Quoddy_fac[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50],
diffspace = as.numeric(diffspace_Quoddy)[as.numeric(difftime_Quoddy)<85 & as.numeric(diffspace_Quoddy)<50]),
aes(y=diffspace, group=difftime)) +
geom_boxplot() +
xlab('difference in time (days)') +
ylab('difference in space (km)')
In the first plot we see a large separation in the y-axis. This is caused by a cluster of sightings made way East of port. This is seen in the second plot as the right-most cluster of blue points. To stop this from impacting our estimate of trend, we restrict the spatial differences and times to be those within each of the two clusters of sightings. The third plot displays the results from doing this. Once again, we see some potential evidence of autocorrelation upto ~14 days. This could be due to persistence of animal movement, for example, due to a few whales remaining in the same area for a few days.
We have seen some evidence that residual autocorrelations remain in the sightings data. This is despite subsetting the data to the initial daily sightings. The result of this may be overdispersion. We may be able to partially correct for this by using of a log-Gaussian Cox process (introduced in lecture 2). By including latent effects (e.g. spatial fields) in the model, we can hopefully capture some of this additional clustering. In fact, log-Gaussian Cox processes are always overdispersed relative to Poisson processes.
However, trying to determine which latent effects are really describing \(\lambda_{true}\) and which are simply capturing the movement persistence is problematic! For example, if we falsely attribute too much variability and clustering to persistence, then we risk underestimating the uncertainty with our estimates of \(\lambda_{true}\)! This could lead to over-confident inference.
It is common for multiple competing models to be fit. Deciphering which model fits the data ‘best’ using information criterion alone (AIC, DIC, etc.,) can be problematic. Performing model comparisons using the same dataset as were used to fit the models and perform exploratory analysis can be problematic (overfitting can occur). An alternative approach is to test both the predictive accuracy and the estimated uncertainty on an unseen dataset.
We now subset the data into a training and test set. We choose the training set to contain all the sightings from 2007-2009. For the test data, we choose all the sightings from 2011. See Additional tips for an explanation of why the potential repeated sighting in 2011 between survey and whale-watch datat, means that the whale-watching data cannot be added to the training dataset. Note that we perform all remaining exploratory analysis on the training data from hereon out.
Below, we split the data into training and test data.
Sightings_Brier_nodup_test <- Sightings_Brier_nodup[Sightings_Brier_nodup$YEAR==2011,]
Sightings_Brier_nodup <- Sightings_Brier_nodup[Sightings_Brier_nodup$YEAR!=2011,]
Sightings_Quoddy_nodup_test <- Sightings_Quoddy_nodup[Sightings_Quoddy_nodup$YEAR==2011,]
Sightings_Quoddy_nodup <- Sightings_Quoddy_nodup[Sightings_Quoddy_nodup$YEAR!=2011,]
Sightings_survey_test <- Sightings_survey[Sightings_survey$YEAR==2011,]
Sightings_survey <- Sightings_survey[Sightings_survey$YEAR!=2011,]
Effort_survey_test <- Effort_survey[Effort_survey$YEAR=='2011',]
Effort_survey <- Effort_survey[Effort_survey$YEAR!='2011',]
# How many survey sightings by year?
table(Sightings_survey$YEAR)
##
## 2007 2008 2009
## 45 17 1
Next, we evaluate a crude estimate of the total amount of survey effort spent each year, as measured by the total trackline length. Using this, we then crudely estimate a ‘catch per unit effort’ (CPUE) measure by year.
# How much survey effort (in trackline length) is there by year
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum)
## Effort_survey$YEAR: 2007
## [1] 11643.15
## ------------------------------------------------------------
## Effort_survey$YEAR: 2008
## [1] 571.9083
## ------------------------------------------------------------
## Effort_survey$YEAR: 2009
## [1] 819.9337
# Crude estimate of relative `CPUE'
(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))/
min(table(Sightings_survey$YEAR) /
by(gLength(Effort_survey,byid = T), Effort_survey$YEAR, sum))
##
## 2007 2008 2009
## 3.168989 24.372565 1.000000
# How many WW sightings by year
table(Sightings_Quoddy_nodup$YEAR)
##
## 2007 2008 2009
## 40 24 44
table(Sightings_Brier_nodup$YEAR)
##
## 2007 2008 2009
## 24 19 19
The ‘CPUE’ values are highly variable! The wild differences could be due to the small efforts in years 2008 and 2009 relative to 2007. Alternatively, the differences could actually reflect real differences in whale density across the three distinct regions visited each year! Without taking a model-based approach and controlling for differences in spatial effort, it is unclear how to combine the sightings from these three surveys.
As we will see in the next tutorial, by taking a model-based approach to inference, we will be able to combine the data from all three surveys to estimate the whale density across space. Unfortunately, due to the lack of spatial overlap between the surveys’ efforts (tracklines), we are unable to account for any differences between the survey protocols within a model without strong prior information available. For example, an intercept added to the model for each survey would be confounded by the spatial field. From now on, we assume that each survey is equivalent.
We will investigate trends between the whale density and the covariate bathymetry. We use only the training data.
A quick first step is to evaluate whether the empirical densities of each covariate at the encounter locations (representing habitat used by the whale) differ from the empirical density across the domain (representing available habitat values). Such difference may indicate that the whales associated with a subset of the available habitat (or indicate differences in observer coverage).
To extract the covariate values at each location, we will convert the SpatialPixelsDataFrame objects into objects of class im from the spatstat package. Then, after creating a ppp point pattern object containing the sightings locations from all sources, we can easily extract the values of each covariate at each point location1.
# Convert covariates to im class
Bathym_im <- as(as(Bathym, 'SpatialGridDataFrame'),'im')
# Convert Observation locations to ppp object
All_obs_ppp <- ppp(x=c(Sightings_survey@coords[,1],
Sightings_Brier_nodup@coords[,1],
Sightings_Quoddy_nodup@coords[,1]),
y=c(Sightings_survey@coords[,2],
Sightings_Brier_nodup@coords[,2],
Sightings_Quoddy_nodup@coords[,2]),
marks=
as.factor(c(rep('survey',dim(Sightings_survey@coords)[1]),
rep('Brier',dim(Sightings_Brier_nodup@coords)[1]),
rep('Quoddy',dim(Sightings_Quoddy_nodup@coords)[1]))),
window = as(Domain, 'owin'))
# Notice how easy it is to extract the values of bathymetry at these locations!
Bathym_obs <- Bathym_im[All_obs_ppp]
First, we plot the empirical densities of the Bathymetry at the sightings locations at throughout the domain (study area). Let’s only consider values in water (max bathym is 313, so restrict to < 0).
ggplot(data=data.frame(Bathym_obs = Bathym_obs),
aes(x=Bathym_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Bathym_dom = as.numeric(Bathym_im$v[Bathym_im$v<0])),
aes(x=Bathym_dom, colour='Domain')) + xlab('Bathymetry (m)') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
We see a very skew distribution of Bathymetry with a huge range of values. Leaving the covariate in this form risks wild predictions of intensity being made! A log transform is desired. We also flip the sign to make it interpretable as log depth (larger is deeper).
ggplot(data=data.frame(Bathym_obs = log(1-Bathym_obs)),
aes(x=Bathym_obs, colour="Sightings")) +
geom_density() +
geom_density(data=data.frame(Bathym_dom = log(as.numeric(1-Bathym_im$v[Bathym_im$v<0]))),
aes(x=Bathym_dom, colour="Domain")) +
geom_vline(xintercept = 0, colour='blue') + xlab('log Depth') +
scale_color_manual(name = "Density", values = c('Sightings' = 'black', 'Domain' = 'red'))
Perhaps there is evidence of whales preferring areas of shallower waters?
These plots may suggest that sightings locations are associated with regions with shallower waters compared what is available in the study area. However, these results are really investigating the associations between \(lambda_{obs}\) and the covariates and not the desired associations between \(lambda_{true}\) and the covariates. The associations may simply reflect the types of regions preferentially visited by observers! Remember - most encounters are made by the whale watch vessels close to port!
In lecture 1, we introduced the simplest distributional assumption on the counts. We assumed that the distribution of the number of encounters within any region \(A\) was Poisson distributed, with mean equal to the integrated intensity surface: \[N(Y \cap A) \backsim Poisson(\int \lambda_{obs}(s) ds)\].
We will see in lecture 2, that if we define a set of regions \(A_i\) to be a grid that covers \(\Omega\), if we increase the number of cells the likelihood of the Poisson counts converges to the likelihood of the Inhomogeneous Poisson process.
So, to investigate trends between the species intensity and the covariates, we fit Poisson point process models to the sightings data using the ppm() function from the spatstat package. These are similar to Maxent models. Ignoring effort for the time being, what does a Poisson point process model tell us?
# Transforming the covariate
log_Bathym_im <- Bathym_im
log_Bathym_im[log_Bathym_im >= 0] <- -1
log_Bathym_im$v <- log(1-log_Bathym_im$v)
# Fitting the Poisson point process model
Ppm_1 <- ppm(All_obs_ppp~log_Bathym_im,
data=list(log_Bathym_im=log_Bathym_im,
Domain=as(Domain, 'owin')),
subset = Domain)
summary(Ppm_1)
## Point process model
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.formula(Q = All_obs_ppp ~ log_Bathym_im, data = list(log_Bathym_im = log_Bathym_im,
## Domain = as(Domain, "owin")), subset = Domain)
## Edge correction: "border"
## [border correction distance r = 0 ]
## --------------------------------------------------------------------------------
## Quadrature scheme (Berman-Turner) = data + dummy + weights
##
## Data pattern:
## Marked planar point pattern: 232 points
## Average intensity 0.000797 points per square unit
## Multitype:
## frequency proportion intensity
## Brier 62 0.267 0.000213
## Quoddy 107 0.461 0.000368
## survey 63 0.272 0.000217
##
## Window: polygonal boundary
## 2 separate polygons (1 hole)
## vertices area relative.area
## polygon 1 1389 291110.00 1.000000
## polygon 2 (hole) 38 -179.42 -0.000617
## enclosing rectangle: [112.9539, 1025.5] x [4544.234, 5259.268] units
## (912.5 x 715 units)
## Window area = 290931 square units
## Fraction of frame area: 0.446
##
## Dummy quadrature points:
## 64 x 64 grid of dummy points, plus 4 corner points
## dummy spacing: 14.25853 x 11.17240 units
##
## Original dummy parameters: =
## Marked planar point pattern: 6248 points
## Average intensity 0.0215 points per square unit
## Multitype:
## frequency proportion intensity
## Brier 2100 0.336 0.00721
## Quoddy 2050 0.329 0.00706
## survey 2100 0.336 0.00721
##
## Window: polygonal boundary
## 2 separate polygons (1 hole)
## vertices area relative.area
## polygon 1 1389 291110.00 1.000000
## polygon 2 (hole) 38 -179.42 -0.000617
## enclosing rectangle: [112.9539, 1025.5] x [4544.234, 5259.268] units
## (912.5 x 715 units)
## Window area = 290931 square units
## Fraction of frame area: 0.446
## Quadrature weights:
## (counting weights based on 64 x 64 array of rectangular tiles)
## All weights:
## range: [1.1, 159] total: 870000
## Weights on data points:
## range: [1.1, 79.7] total: 5550
## Weights on dummy points:
## range: [1.1, 159] total: 864000
## --------------------------------------------------------------------------------
## FITTED MODEL:
##
## Nonstationary multitype Poisson process
## Possible marks:
## Brier Quoddy survey
## ---- Intensity: ----
##
## Log intensity: ~log_Bathym_im
## Model depends on external covariate 'log_Bathym_im'
## Covariates provided:
## log_Bathym_im: im
## Domain: owin
##
## Fitted trend coefficients:
## (Intercept) log_Bathym_im
## -5.6516115 -0.5267851
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) -5.6516115 0.22612812 -6.0948144 -5.2084085 *** -24.99296
## log_Bathym_im -0.5267851 0.04877631 -0.6223849 -0.4311853 *** -10.80002
##
## ----------- gory details -----
##
## Fitted regular parameters (theta):
## (Intercept) log_Bathym_im
## -5.6516115 -0.5267851
##
## Fitted exp(theta):
## (Intercept) log_Bathym_im
## 0.003511853 0.590500297
Looking at the table of coefficients and Ztest symbols, we find some evidence that the whales may be found more often in regions with shallower waters. Ignoring effort, we find a significant association (p < 0.05 indicated by at least one star in the Ztest column) between \(\lambda_{obs}\) and log depth.
Next, we can attempt to make a crude adjustment for effort and see if these estimates change. We have already created a SpatialPixelsDataFrame object containing distance from the two whale-watch ports. Next, we can create a SpatialPixelsDataFrame containing the perpendicular distances to the nearest trackline. This fails to account for overlap in survey tracklines, but offers a reasonable first approach to crudely control for effort for the purposes of exploratory analysis.
Dist_Surveylines <- as(Dist_Brier,'SpatialPoints')
Dist_Surveylines$Dist_surveylines <- as.numeric(apply(gDistance(Dist_Surveylines,
Effort_survey,byid = T),2,min))
Dist_Surveylines <- as(Dist_Surveylines,'SpatialPixelsDataFrame')
We can then map the distance to the nearest trackline and plot the histogram of nearest distances.
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"RdYlBu")),
limits = range(...))
}
# Map the distance through space
ggplot() + gg(Domain) + gg(Dist_Surveylines) + colsc(Dist_Surveylines@data[,1])
# What is the maximum detection distance?
ggplot(Sightings_survey@data, aes(x=DISTANCE)) + geom_histogram()
The histogram of the perpendicular distances from the aircraft shows an apparent maximum detection probability at a value of distance above 0! This is common for aerial surveys. Given the limit size of the survey data chosen for the workshop, and the potential numerical issues we may face fitting a complex detection probability function, we will fix all values less than 250m equal to 250m. This crude approach will help to make the histogram appear as a monotonically decreasing function of distance, satisfying the simplest class of detection probability functions. This shouldn’t be done in practice! A 2-parameter detection probability function could instead be fit.
In addition, it is often advised in distance sampling applications to threshold our upper detection distances at a ‘sensible’ value. Let’s plot our observed distances from the trackline. It looks like 2km could be a reasonable threshold value to choose. Let’s set all values above 2km equal to 2km and scale the distances onto the km units scale. This rescaling to units of km is crucial for numerical stability. We define a variable ‘distance’ that contains the rescaled distances. Note that a few distances are missing. We impute these with the mean distance.
Sightings_survey@data$DISTANCE <- ifelse(Sightings_survey@data$DISTANCE>250,Sightings_survey@data$DISTANCE,250)
Sightings_survey@data$DISTANCE <-
ifelse(Sightings_survey$DISTANCE>2000 & !is.na(Sightings_survey$DISTANCE),
2000,Sightings_survey$DISTANCE)
# needs renaming to match the formula argument
Sightings_survey$distance <- Sightings_survey$DISTANCE
# There are a couple of missing distances. Impute these with the mean
Sightings_survey$distance[is.na(Sightings_survey$distance)] <- mean(Sightings_survey$distance,na.rm=T)
Sightings_survey <- Sightings_survey[,-c(8)]
Sightings_survey$distance <- Sightings_survey$distance / 1000
ggplot(Sightings_survey@data, aes(x=distance)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram perhaps shows an initial steady decline in sightings with distance, before decaying more quickly as distance increases. Again, we need to remember that this plot is showing associations between distance \(d\) and \(\lambda_{obs}\) and not with \(p(d)\). To capture this shape, we will use a half-norm detection function in the Poisson point process model: \[p(d) = exp\left(\frac{-d^2}{2\sigma^2}\right)\] This requires us to square the values of distance from the trackline.
Now, we investigate a potential functional form for the distance from port covariate. To do this, we plot a histogram of the distances from port at each of the whale watch sighting locations.
# Plot the sightings with distance from the port
hist(over(Sightings_Brier_nodup,Dist_Brier)$Dist_Brier, main='Histogram of the distance from port of the Brier sightings', xlab = 'Distance from port')
hist(over(Sightings_Quoddy_nodup,Dist_Quoddy)$Dist_Quoddy,breaks=20, main='Histogram of the distance from port of the Quoddy sightings', xlab = 'Distance from port')
For both ports, we detect a decreasing frequency of sightings made as the distance from port increases. This relationship is clearer for Brier sightings than for Quoddy sightings. Furthermore, the functional form of the effect does not appear to be an exponential decay. Based on the histograms, we choose to model the functional form as a half-normal function. More complicated functional forms (e.g. weibull or hazard functions) could be used.
On the log scale, this covariate can be estimated as a linear effect on the squared distance from port. Thus, we create SpatialPixelsDataFrame objects which store the squared distances from port.
We assume the same functional relationship between each whale watch effort intensity \(\lambda_{eff, i}\) and distance from their port location \(s_i^*\). Here \(i\) indicates the port and \(s_i^*\) denotes the \(i^{th}\) port location. We scale the half-norm functions by a unique constant \(\lambda_i\), to capture the different number of vessels across the two ports: \[\lambda_{eff,i}(s) = \lambda_i exp\left(\frac{-||s-s_i^*||^2}{2\sigma_i^2} \right)\].
Dist_Surveylines$Dist_surveylines <- scale(Dist_Surveylines$Dist_surveylines^2)
Dist_Brier_sq <- as(Dist_Brier,'SpatialPixels')
Dist_Brier_sq$Dist_Brier_sq <- scale(Dist_Brier$Dist_Brier^2)
Dist_Quoddy_sq <- as(Dist_Quoddy,'SpatialPixels')
Dist_Quoddy_sq$Dist_Quoddy_sq <- scale(Dist_Quoddy$Dist_Quoddy^2)
Now we fit a Inhomogeneous Poisson point process model. As we are specifying the model on the log intensity, we need to include unique intercept parameters \(\alpha_{0,i} : i \in \{1,2\}\) for the two whale watch port intensities (\(\alpha_{0,i} = log \lambda_i\)) and three unique parameters \(\alpha_1\) and \(\alpha_{1,i} : i \in \{1,2\}\) defining the shapes of the three half norm functions (\(\alpha_{1,i} = \frac{-1}{2\sigma_i^2}\) and \(\alpha_1 = \frac{-1}{2\sigma^2}\)).
We must also inform spatstat about the source of each sighting, by fixing the value of its mark.
# All_obs_ppp$marks <-
# as.factor(c(rep('survey',dim(Sightings_survey@coords)[1]),
# rep('Brier',dim(Sightings_Brier_nodup@coords)[1]),
# rep('Quoddy',dim(Sightings_Quoddy_nodup@coords)[1])))
Ppm_2 <- ppm(All_obs_ppp~log_Bathym_im +
I(ifelse(marks=='Brier',1,0)):Dist_Brier_im +
I(ifelse(marks=='Quoddy',1,0)):Dist_Quoddy_im +
I(ifelse(marks=='survey',1,0)):Dist_Survey_im +
marks,
data=list(log_Bathym_im=log_Bathym_im,
Dist_Brier_im = as(as(Dist_Brier_sq,'SpatialGridDataFrame'),'im'),
Dist_Quoddy_im = as(as(Dist_Quoddy_sq,'SpatialGridDataFrame'),'im'),
Dist_Survey_im = as(as(Dist_Surveylines,'SpatialGridDataFrame'),'im'),
Domain=as(Domain, 'owin')),
subset = Domain)
coef(summary(Ppm_2))
## Estimate S.E.
## (Intercept) -977.9745269 123.64065586
## log_Bathym_im 0.4862812 0.08148644
## marksQuoddy 567.7490421 126.94313145
## markssurvey 697.5703053 133.08577596
## I(ifelse(marks == "Brier", 1, 0)):Dist_Brier_im -819.1401759 104.13508711
## I(ifelse(marks == "Quoddy", 1, 0)):Dist_Quoddy_im -296.8180231 23.86004018
## I(ifelse(marks == "survey", 1, 0)):Dist_Survey_im -657.8434500 120.96619940
## CI95.lo CI95.hi
## (Intercept) -1220.3057594 -735.6432944
## log_Bathym_im 0.3265707 0.6459917
## marksQuoddy 318.9450764 816.5530078
## markssurvey 436.7269776 958.4136331
## I(ifelse(marks == "Brier", 1, 0)):Dist_Brier_im -1023.2411962 -615.0391556
## I(ifelse(marks == "Quoddy", 1, 0)):Dist_Quoddy_im -343.5828425 -250.0532037
## I(ifelse(marks == "survey", 1, 0)):Dist_Survey_im -894.9328442 -420.7540558
## Ztest Zval
## (Intercept) *** -7.909813
## log_Bathym_im *** 5.967634
## marksQuoddy *** 4.472468
## markssurvey *** 5.241509
## I(ifelse(marks == "Brier", 1, 0)):Dist_Brier_im *** -7.866130
## I(ifelse(marks == "Quoddy", 1, 0)):Dist_Quoddy_im *** -12.439963
## I(ifelse(marks == "survey", 1, 0)):Dist_Survey_im *** -5.438242
With this rough attempt at controlling for heterogeneous effort, we see that log depth is found to be significantly associated with the whale density, with a positive coefficient. The negative effects of all three complicated expressions imply that a negative association between ‘distance’ and observed intensity is witnessed (agreeing with common sense).
We can plot the estimated ‘true’ whale intensity \(\hat{\lambda}_{true}(s)\) predicted from this model:
# remove the effects of observers for plotting
pred_coef <- coef(Ppm_2)
pred_coef[c(1,3,4,5,6,7)] <- 0
plot(predict.ppm(Ppm_2, ngrid=c(300,300),
new.coef=pred_coef)[[1]],superimpose=F,
hcl.colors(60, "YlOrRd", rev = TRUE))
Alternatively, we can also plot the fitted observed intensity surface \(\lambda_{obs} =\lambda_{true} \lambda_{effort}\)) for all observer types. We plot the observed intensity surfaces for both the survey and Quoddy-based whale watch vessels.
pred_coef <- coef(Ppm_2)
plot(log(predict(Ppm_2, ngrid=c(300,300),
new.coef=pred_coef)[[3]]),superimpose=F,
hcl.colors(60, "YlOrRd", rev = TRUE))
# Plot Quoddy
plot(log(predict(Ppm_2, ngrid=c(300,300),
new.coef=pred_coef)[[2]]),superimpose=F,
hcl.colors(60, "YlOrRd", rev = TRUE))
We see that the observed intensity, \(\lambda_{obs}\) is dominated by the effort \(\lambda_{eff}\). Despite this, we still detect significant effects of both log depth.
This exploratory analysis is useful at informing us about possible species-environment relationships to investigate in our futurer models. By playing around with different covariates and functional forms (e.g. quadratic effects), we can develop an understanding of which models to formally compare in the modelling stage.
Note that the above effort-correction approach is not the same as the one performed by inlabru later. It should only be used for exploratory purposes.
Unfortunately, by being in the inhomogeneous Poisson process class of point processes, the above models lack one crucial feature. They are unable to account for any additional spatial correlations/clustering that are frequently caused by unmeasured environmental covariates and biological processes acting behind the scenes.
This is why we turn to log-Gaussian Cox processes in the next tutorial. By assuming the intensity surface to be a realisation from a spatially-smooth (log-Gaussian) random field, we will be able to adjust our inference to take into account any additional spatial correlations and/or unmeasured covariates! By doing this, both parameter estimates and measures of uncertainty should be better adjusted to account for these autocorrelations. We will describe log-Gaussian Cox processes in depth in the next lecture.
If you got stuck on any of the exercises, then please feel free to try them again. Here are links to the problems:
over() function to extract the values of a spatial covariate (of type SpatialPixelsDataFrame) at point locations (stored as a SpatialPointsDataFrame object). Remember to subset the data accordingly!)For publication, there can be issues regarding copyright of Google Maps. Using OpenStreetMap can help. To guarantee this simply add the following argument: source='osm', force=TRUE to gmap(). Double check the console that the maps are indeed being downloaded from stamen or osm. For brevity we have suppressed the messages.
gmap(Sightings_survey, source='osm', force=TRUE) +
gg(Domain) +
gg.spatiallines_mod(Effort_survey) +
gg(Sightings_survey, colour='blue') +
gg(Sightings_DRWW_sp, colour='purple') +
gg(WW_ports, colour='red')
Note for this to work it needs to be in the original project (lat/lon), not UTM.
The multiplot() function is a very flexible function that enables publication-quality figures to be made with relative ease.
To change the order of the plots you can change the argument byrow=TRUE to byrow=FALSE.
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(1:4, nrow=2, ncol=2, byrow = FALSE))
We can also change the size of the figures by changing the matrix in the layout.
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry'),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)'),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)'),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
As you can see plots of different size stretches the maps, to keep the set projection we can use coord_fixed(ratio = 1).
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
coord_fixed(ratio = 1),
layout=matrix(c(1,1,2,3), nrow=2, ncol=2, byrow = TRUE))
We can define the colour palette easily.
colsc <- function(...) {
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(11,"PuBuGn")),
limits = range(...))
}
Look at ?RColorBrewer::brewer.pal to see what other colour palettes are available.
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
colsc(Bathym@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(11, "PuBuGn"): n too large, allowed maximum for palette PuBuGn is 9
## Returning the palette you asked for with that many colors
Have a go at creating your own colour palette function. Investigate the effects of changing both arguments to brewer.pal.
colsc2 <- function(...){
scale_fill_gradientn(colours = rev(RColorBrewer::brewer.pal(7,"Spectral")),
limits = range(...))
}
multiplot(ggplot() +
gg(Domain) +
gg(Bathym) + xlab('East(km)') + ylab('North(km)') + labs(fill='Bathymetry') +
colsc2(Bathym@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Brier) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Brier@data[,1]),
ggplot() +
gg(Domain) +
gg(Dist_Quoddy) + xlab('East(km)') + ylab('North(km)') +
colsc2(Dist_Quoddy@data[,1]),
layout=matrix(1:4, nrow=2, ncol=2, byrow = T))
Could we not also put 2011’s whale-watch data into the training set? Below is some code showing that we cannot. A whale is sighted by at least one whale watch company on every day in 2011 that a survey detects a whale. The plot below shows that these sightings could have been of the same animal. Even if they weren’t, the tracklines in 2011 still visited areas that the whale watch vessels were present. Since our training and test datasets are required to be independent of each other, we must therefore also remove the 2011 whale watch sightings from the training data.
# what dates were survey sightings made on in 2011?
unique(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011])
## [1] "2011-06-21" "2011-08-12" "2011-08-13"
# Are sightings by the WW vessels made on these dates?
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][1],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][2],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
sum(grepl(Sightings_survey_test$DATE_LO[Sightings_survey_test$YEAR==2011][3],
x=c(Sightings_Quoddy_nodup_test$WS_DATE[Sightings_Quoddy_nodup_test$YEAR==2011],
Sightings_Brier_nodup_test$WS_DATE[Sightings_Brier_nodup_test$YEAR==2011])))>0
## [1] TRUE
# Could they be of the same animal?
ggplot() + gg(Domain) +
gg(Sightings_survey_test[Sightings_survey_test$YEAR==2011,],colour='green') +
gg.spatiallines_mod(Effort_survey_test[Effort_survey_test$YEAR==2011,],colour='yellow')
The code for hiding the Rmd code chunks came from Martin Schmelzer, found here
References/Sources for data sets:
DFO Maritimes Region Whale Sightings Database
MacDonald, D., Emery, P., Themelis, D., Smedbol, R.K., Harris, L.E., and McCurdy, Q. 2017. Marine mammal and pelagic animal sightings (Whalesightings) database: a users guide. Can. Tech. Rep. Fish. Aquat. Sci. 3244: v + 44 p.
NOAA NARW Surveys
Timothy V.N. Cole. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA
2007 TNASS DFO Aerial Survey
Lawson. J.W., and Gosselin, J.-F. 2009. Distribution and preliminary abundance estimates for cetaceans seen during Canada’s marine megafauna survey - A component of the 2007 TNASS. DFO Can. Sci. Advis. Sec. Res. Doc. 2009/031. vi + 28 p.
NOAA Cetacean Surveys
Timothy V.N. Cole and D. Palka. National Oceanic Atmospheric Administration, National Marine Fisheries Service, Northeast Fisheries Science Center. 166 Water Street, Woods Hole, MA, USA
Alternatively, we can use the function over() from the sp package. We choose to create im objects so that we can later use the ppm function from spatstat for exploratory analysis purposes. ppm fits Poisson process models, which can be equivalent to Maxent models.↩︎